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Executive  Summary 


The  project  goals  were  to  develop  automated  techniques  and  tools  for  analysis  and  reverse  engineer¬ 
ing  of  highly-obfuscated  malware  codes.  The  project  made  significant  progress  in  this  regard.  Two 
very  different  kinds  of  malware  were  considered;  because  of  the  different  nature  of  the  malicious 
code  in  each  case,  fundamentally  different  techniques  were  employed.  The  two  kinds  of  malware, 
and  specific  accomplishments  for  each,  are  listed  below. 

1.  Web-delivered  malware,  which  is  typically  in  the  form  of  (obfuscated)  JavaScript  programs. 
This  kind  of  malware  is  currently  among  the  commonest  way  for  infections  to  occur. 

Specific  accomphishments  include: 

(a)  Development  of  novel  techniques  for  reverse  engineering  obfuscated  JavaScript  code  [2]. 

(b)  Identification  of  weaknesses  in  existing  techniques  for  detecting  web-borne  malware  [3,  5]. 

(c)  Development  of  client-side  defenses  against  trigger-based  web-based  malware  [4] . 

2.  Native  executables,  whcich  refer  to  machine  code  programs  that  execute  natively.  Regardless 
of  whether  an  infection  happened  through  the  web  via  JavaScript  code  or  not,  in  the  end  the 
malicious  actions  are  typically  carried  out  by  native  executables. 

Specific  accomplishments  include: 

(a)  Development  of  improved  techniques  for  information  flow  analysis  in  software  [8]. 

(b)  Generic  techniques  for  deobfuscation  of  executable  code  [9]. 

(c)  Information  flow  based  dynamic  analysis  techniques  for  identifying  self-checksum-based 
anti-tamper  defenses  in  software  [6,  7]. 

The  project  led  to  two  PhD  dissertations: 

1.  Gen  Lu.  Analysis  of  Evasion  Techniques  in  Web-Based  Malware.  PhD  Dissertation,  The 
University  of  Arizona,  Jan.  2014. 

This  dissertation  focused  on  analysis  of  web-based  malware.  The  significant  accomplishments 
of  this  work  are  listed  above. 


1 


2.  Babak  Yadegari.  A  Generic  Approach  to  Deobfuscation.  PhD  Dissertation,  The  University 
of  Arizona,  May  2016  (expected). 

This  dissertation  focused  on  analysis  of  native  malware.  The  significant  accomplishments  of 
this  work  are  listed  above. 

Software  and  data  samples  resulting  from  the  project  are  available  to  the  research  community  at 

http : //www. cs . arizona.edu/projects/lynx/Samples/. 


1  Project  Objectives 


This  project  aimed  to  address  the  lack  of  automated  tool  support  for  malware  analysis  by  developing 
a  general  framework  and  techniques  to  automate  much  of  the  task  of  deobfuscating  malware  binaries 
and  thereby  dramatically  speed  up  the  process  of  understanding  malware  code.  This  goal  was  to 
be  attained  through  two  main  objectives:  first,  the  development  of  semantics-based  techniques  for 
identifying  and  removing  obfuscation  code;  and  second,  the  synthesis  of  simplification  techniques 
to  transform  the  resulting  low-level  machine  code  to  program  representations  that  are  easier  to 
reason  about  and  understand. 


2  Accomplishments/New  Findings 

The  project  focused  on  advanced  semantics-based  techniques  to  understand  the  behavior  of  obfus¬ 
cated  code,  in  particular  code  that  may  have  been  obfuscated  in  various  ways  to  resist  analysis, 
possibly  using  obfuscations  that  we  do  not  know  about  and  therefore  cannot  anticipate.  In  the 
context  of  this  focus,  the  project  looked  at  three  main  topics:  analysis  of  web-borne  malware,  in 
particular  drive-by  downloads  from  infected  web  pages;  analysis  of  executables  armored  with  vari¬ 
ous  static  and  dynamic  anti- analysis  defenses;  and  foundational  topics  in  semantics-based  malware 
analysis. 


1.  Analysis  of  Web-Borne  Malware.  Web-based  mechanisms,  often  mediated  by  malicious 
JavaScript  code,  play  an  important  role  in  malware  delivery  today,  making  defenses  against  web- 
borne  malware  crucial  for  system  security.  This  work  investigates  improved  techniques  for  defending 
against  web-borne  malware. 

1.  Novel  techniques  for  reverse  engineering  obfuscated  JavaScript  code  [2]. 

Javascript  is  a  scripting  language  that  is  commonly  used  to  create  sophisticated  interactive 
client-side  web  applications.  It  can  also  be  used  to  carry  out  browser-based  attacks  on  users. 
Malicious  JavaScript  code  is  usually  highly  obfuscated,  making  detection  a  challenge.  This 
work  describes  a  simple  approach  to  deobfuscation  of  JavaScript  code  based  on  dynamic  anal¬ 
ysis  and  slicing.  Experiments  using  a  prototype  implementation  indicate  that  the  approach 
described  is  able  to  penetrate  multiple  layers  of  complex  obfuscations  and  extract  the  core 
logic  of  the  computation. 

Figure  1  shows  an  fragment  of  obfuscated  JavaScript  code  we  extracted  from  a  malicious  web 
page  along  with  the  deobfuscated  code  obtained  automatically  from  it  using  our  techniques. 
The  original  malware  sample  goes  through  three  execution  contexts:  Context  1  resides  in  the 
web  page  opened  by  user’s  web  browser;  it  is  a  small  piece  of  obfuscated  JavaScript  code 
that,  when  executed,  invokes  document. write()  method  to  dynamically  insert  a  hidden  iFrame, 
and  causes  an  external  web  page  to  be  loaded;  in  this  case  this  web  page  was  on  a  hacked 
web  page  in  Germany.  This  newly  loaded  web  page  contains  more  obfuscated  code  (context 
2).  Context  2  then  causes  another  level  of  code  unfolding  using  evalQ  and  generates  context 
3,  which  is  the  intended  payload:  this  context  uses  a  dynamically  created  hidden  iFrame  to 
open  a  PDF  hie,  hosted  on  a  machine  in  China,  that  exploits  a  vulnerability  in  Adobe  Reader. 
The  final  recovered  code  is  very  close  to  what  one  might  obtain  if  deobfuscating  the  malicious 
code  manually;  importantly,  the  intermediate  steps  involving  web  page  redirections  through 
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(a)  Original  obfuscated  JavaScript 


varO  =  0; 

while  (varO  <  navigator .plugins . length)  { 
varl  =  navigator .plugins [local_var0] .name ; 
if  (varl . indexOf ("Adobe  Reader")  !=  -1) 

document .write ("<iframe  src=’ . /f 3256c .pdf  * 
width= 5 1 ’  he ight = 5 1 ’ ) 
f rameborder=0></if rame>") ; 
varO++; 
continue ; 

> 

( b )  Deobfuscated  JavaScript 


Figure  1:  Semantics-based  deobfuscation  of  malicious  JavaScript  sample 

dynamic  iFrames  are  removed  during  the  deobfuscation  process,  leaving  only  the  essence  of 
the  malicious  actions. 

2.  Identification  of  weaknesses  in  existing  techniques  for  detecting  web-borne  malware  [3,  5]. 

This  work  explores  weaknesses  in  existing  approaches  to  the  detection  of  malicious  JavaScript 
code.  These  approaches  generally  fall  into  two  categories:  lightweight  techniques  focusing  on 
syntactic  features  such  as  string  obfuscation  and  dynamic  code  generation;  and  heavier-weight 
approaches  that  look  for  deeper  semantic  characteristics  such  as  the  presence  of  shellcode¬ 
like  strings  or  execution  of  exploit  code.  We  show  that  each  of  these  approaches  has  its 
weaknesses,  and  that  state-of-the-art  detectors  using  these  techniques  can  be  defeated  using 
cloaking  techniques  that  combine  emulation  with  dynamic  anti-analysis  checks. 

Figure  2  shows  the  high-level  architecture  of  software  using  these  cloaking  techniques.  We 
used  three  malware  detectors,  covering  a  wide  spectrum  of  detection  technologies,  for  our 
experiments:  Virus  Total,  an  online  portal  to  a  collection  of  anti-virus  software  with  up-to-date 
exploit  databases  that  exemplifies  current  commercial  malware  detection  technology;  Zozzle, 
a  static  detector  based  on  machine  learning;  and  Wepawet,  a  hybrid  detection  system  based 
on  JSAND  that  represents  a  state-of-the-art  combination  of  static  and  dynamic  analyses. 
These  three  detectors,  range  from  traditional  signature  matching  to  state-of-the-art  static 
and  dynamic  analyses,  represent  the  current  state  of  detection  techniques.  None  of  these 
detectors  was  able  to  penetrate  the  cloaking  technique  described  and  identify  potentially 
malicious  content  embedded  within  the  programs. 

3.  Client-side  defenses  against  trigger-based  web-based  malware  [4], 

Web-based  malware  tend  to  be  environment-dependent,  which  poses  a  significant  challenge 
on  defending  web-based  attacks,  because  the  malicious  code — which  may  be  exposed  and 
activated  only  under  specific  environmental  conditions  such  as  the  version  of  the  browser — 
may  not  be  triggered  during  analysis.  This  work  proposes  a  simple  approach  for  defending 
environment-dependent  malware.  Instead  of  increasing  analysis  coverage  in  detector,  the 
goal  of  this  technique  is  to  ensure  that  the  client  will  take  the  same  execution  path  as  the 
one  examined  by  the  detector.  This  technique  is  designed  to  work  alongside  a  detector,  it 
can  handle  cases  existing  multi-path  exploration  techniques  are  incapable  of,  and  provides 
an  efficient  way  to  identify  discrepancies  in  a  JavaScript  program’s  execution  behavior  in  a 
user’s  environment  compared  to  its  behavior  in  a  sandboxed  detector,  thereby  detecting  false 
negatives  that  may  have  been  caused  by  environment  dependencies.  Experiments  show  that 
this  technique  can  effectively  detect  environment-dependent  behavior  discrepancy  of  various 
forms,  including  those  seen  in  real  malware. 


Figure  2:  Code  architecture  to  bypass  (existing)  defenses  against  web-borne  malware 


Key: 

(a)  Original  program 

( b )  Obfuscated  program 

(c)  Deobfuscation  result:  traditional  byte-level  taint  analysis 

(d)  Deobfuscation  result:  bit-level  analysis  (taintedness  information  only) 

(e)  Deobfuscation  result:  enhanced  bit-level  analysis  (taintedness  +  taint  source  information) 

(our  algorithm) 

Figure  3:  Impact  of  different  taint  analysis  algorithms  on  quality  of  debofuscation  (Input  program: 
binary  search;  obfuscated  using:  ExeCryptor) 


2.  Analysis  of  Obfuscation  and  Anti- Analysis  Defenses  in  Executable  Programs.  Ma¬ 
licious  software  are  usually  armored  in  various  ways  to  avoid  detection  and  resist  analysis.  When 
new  malware  is  encountered,  such  anti-analysis  defenses  have  to  be  penetrated  in  order  to  under¬ 
stand  the  internal  logic  of  the  code  and  devise  countermeasures.  This  work  investigates  various 
automatic  and  general-purpose  ways  for  defeating  anti-analysis  and  code  obfuscation  defenses. 

1.  Improved  techniques  for  information  flow  analysis  in  software  [8]. 

Taint  analysis  has  a  wide  variety  of  applications  in  software  analysis,  making  the  precision 
of  taint  analysis  an  important  consideration.  Current  taint  analysis  algorithms,  including 
previous  work  on  bit-precise  taint  analyses,  suffer  from  shortcomings  that  can  lead  to  sig¬ 
nificant  loss  of  precision  (under/over  tainting)  in  some  situations.  This  work  explores  these 
limitations  of  existing  taint  analysis  algorithms,  shows  how  they  can  lead  to  imprecise  taint 
propagation,  and  describes  a  generalization  of  current  bit-level  taint  analysis  techniques  to 
address  these  problems  and  improve  their  precision.  Experiments  using  a  deobfuscation  tool 
indicate  that  our  enhanced  taint  analysis  algorithm  leads  to  significant  improvements  in  the 
quality  of  deobfuscation. 

Figure  3  illustrates  the  improvement  in  the  quality  of  reverse-engineering  obfuscated  mali¬ 
cious  code  using  our  algorithm  compared  to  other  taint  analysis  algorithms  described  in  the 
literature. 

2.  Generic  techniques  for  deobfuscation  of  executable  code  [9]. 

This  work  discusses  a  generic  approach  for  deobfuscation  of  obfuscated  executable  code.  The 
approach  described  does  not  make  any  assumptions  about  the  nature  of  the  obfuscations 
used,  but  instead  uses  semantics-preserving  program  transformations  to  simplify  away  ob¬ 
fuscation  code.  We  have  applied  a  prototype  implementation  of  our  ideas  to  a  variety  of 
different  kinds  of  obfuscation,  including  emulation-based  obfuscation,  emulation-based  obfus¬ 
cation  with  runtime  code  unpacking,  and  return-oriented  programming.  Our  experimental 
results  are  encouraging  and  suggest  that  this  approach  can  be  effective  in  extracting  the  in¬ 
ternal  logic  from  code  obfuscated  using  a  variety  of  obfuscation  techniques,  including  tools 
such  as  Themida  that  previous  approaches  could  not  handle. 

Figure  4  shows  the  effects  of  deobfuscation  on  several  emulation-obfuscated  malware  samples. 

3.  Information  flow  based  dynamic  analysis  techniques  for  identifying  self- checksum-based  anti¬ 
tamper  defenses  in  software  [6,  7]. 

Software  self/checksumming  is  widely  used  as  an  anti/tampering  mechanism  for  protecting  in¬ 
tellectual  property  and  deterring  piracy.  This  makes  it  important  to  understand  the  strengths 
and  weaknesses  of  various  approaches  to  self- checksumming.  This  work  investigates  a  dy¬ 
namic  information-flow-based  attack  that  aims  to  identify  and  understand  self-checksumming 
behavior  in  software.  Our  approach  is  applicable  to  a  wide  class  of  self-chesununing  defenses 
and  the  information  obtained  can  be  used  to  determine  how  the  checksumming  defenses  may 
be  bypassed.  Experiments  using  a  prototype  implementation  of  our  ideas  indicate  that  our 
approach  can  successfully  identify  self-checksumming  behavior  in  (our  implementations  of) 
proposals  from  the  research  literature. 


3.  Semantics-based  Approaches  to  Malware  Analysis.  This  work  uses  the  theoretical 
framework  of  abstract  interpretation  to  investigate  foundational  issues  in  semantics-based  malware 
detection. 
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Figure  4:  Effects  of  obfuscation  and  deobfuscation  on  the  control  flow  graphs  of  some  malware 
samples 


1.  Semantics-based  approaches  to  identifying  metamorphic  malware  [1]. 

Metamorphic  code  includes  self-modifying  semantics-preserving  transformations  to  exploit 
code  diversification.  The  impact  of  metamorphism  is  growing  in  security  and  code  protection 
technologies,  both  for  preventing  malicious  host  attacks,  e.g.,  in  software  diversification  for 
IP  and  integrity  protection,  and  in  malicious  software  attacks,  e.g.,  in  metamorphic  malware 
self-modifying  their  own  code  in  order  to  foil  detection  systems  based  on  signature  matching. 
In  this  paper  we  consider  the  problem  of  automatically  extracting  metamorphic  signatures 
from  metamorphic  code.  We  introduce  a  semantics  for  self-modifying  code,  later  called  phase 
semantics,  and  prove  its  correctness  by  showing  that  it  is  an  abstract  interpretation  of  the 
standard  trace  semantics.  Phase  semantics  precisely  models  the  metamorphic  code  behav¬ 
ior  by  providing  a  set  of  traces  of  programs  which  correspond  to  the  possible  evolutions  of 
the  metamorphic  code  during  execution.  We  show  that  metamorphic  signatures  can  be  au¬ 
tomatically  extracted  by  abstract  interpretation  of  the  phase  semantics.  In  particular,  we 
introduce  the  notion  of  regular  metamorphism,  where  the  invariants  of  the  phase  seman¬ 
tics  can  be  modeled  as  finite  state  automata  representing  the  code  structure  of  all  possible 
metamorphic  change  of  a  metamorphic  code,  and  we  provide  a  static  signature  extraction 
algorithm  for  metamorphic  code  where  metamorphic  signatures  are  approximated  in  regular 
metamorphism. 
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